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Abstract 

In  this  paper,  we  present  a  framework  for  parsing  video  events  with  stochastic  Temporal  And-Or  Graph 
(T-AOG)  and  unsupervised  learning  of  the  T-AOG  from  video.  This  T-AOG  represents  a  stochastic  event 
grammar.  The  alphabet  of  the  T-AOG  consists  of  a  set  of  grounded  spatial  relations  including  the  poses 
of  agents  and  their  interactions  with  objects  in  the  scene.  The  terminal  nodes  of  the  T-AOG  are  atomic 
actions  which  are  specified  by  a  number  of  grounded  relations  over  image  frames.  An  And- node  represents 
a  sequence  of  actions.  An  Or- node  represents  a  number  of  alternative  ways  of  such  concatenations.  The 
And-Or  nodes  in  the  T-AOG  can  generate  a  set  of  valid  temporal  configurations  of  atomic  actions,  which  can 
be  equivalently  represented  as  a  stochastic  context-free  grammar  (SCFG).  For  each  And- node  we  model  the 
temporal  relations  of  its  children  nodes  to  distinguish  events  with  similar  structures  but  different  temporal 
patterns  and  interpolate  missing  portions  of  events.  This  makes  the  T-AOG  grammar  context-sensitive. 
We  propose  an  unsupervised  learning  algorithm  to  learn  the  atomic  actions,  the  temporal  relations  and  the 
And-Or  nodes  under  the  information  projection  principle  in  a  coherent  probabilistic  framework.  We  also 
propose  an  event  parsing  algorithm  based  on  the  T-AOG  which  can  understand  events,  infer  the  goal  of 
agents,  and  predict  their  plausible  intended  actions.  In  comparison  with  existing  methods,  our  paper  makes 
the  following  contributions,  i)  We  represent  events  by  a  T-AOG  with  hierarchical  compositions  of  events 
and  the  temporal  relations  between  the  sub-events,  ii)  We  learn  the  grammar,  including  atomic  actions  and 
temporal  relations,  automatically  from  the  video  data  without  manual  supervision,  iii)  Our  algorithm  infers 
the  goal  of  agents  and  predicts  their  intents  by  a  top-down  process,  handles  events  insertion  and  multi- agent 
events,  keeps  all  possible  interpretations  of  the  video  to  preserve  the  ambiguities,  and  achieves  the  globally 
optimal  parsing  solution  in  a  Bayesian  framework;  iv)  The  algorithm  uses  event  context  to  improve  the 
detection  of  atomic  actions,  segment  and  recognize  objects  in  the  scene.  Extensive  experiments,  including 
indoor  and  out  door  scenes,  single  and  multiple  agents  events,  are  conducted  to  validate  the  effectiveness  of 
the  proposed  approach. 

Keywords:  Temporal  And-Or  Graph  (T-AOG),  Event  Parsing,  Unsupervised  learning,  Goal  prediction, 
Information  projection. 


1.  Introduction 

1.1.  Motivation  and  Objective 

Cognitive  studies  [1]  show  that  humans  have  a 
strong  inclination  to  interpret  observed  behaviors 
of  others  as  goal-directed  actions.  In  this  paper, 
we  take  such  a  teleological  stance  for  understand¬ 
ing  events  in  surveillance  video,  in  which  people  are 
assumed  to  be  rational  agents  [2]  whose  actions  are 
planned  to  achieve  certain  goals.  In  this  way,  we 
infer  the  underlying  goals  and  predict  the  next  ac¬ 
tions  on  the  fly  as  the  events  unfold. 


Imagine  an  office  scene,  where  an  agent  picks  up 
a  cup,  and  walks  to  a  desk  on  which  there  is  a  tea 
box.  One  might  infer  that  his  goal  is  to  make  a  cup 
of  tea,  and  one  predicts  that  his  next  action  is  to 
put  a  tea  bag  in  the  cup.  But  instead,  he  picks  up 
the  phone  on  the  desk,  one  then  infers  that  his  goal 
has  been  interrupted  by  an  incoming  call.  After 
the  call,  he  walks  to  a  dispenser,  and  his  action 
is  obscured  due  to  our  viewing  angle.  After  some 
time,  he  is  observed  drinking.  One  can  now  infer 
that  he  had  poured  water  in  the  cup  in  the  occluded 
time  interval. 
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Daily  videos  contain  a  large  variety  of  actions  and 
events,  which  are  defined  through  gestures  and  in¬ 
teractions  between  agents  and  environments.  These 
action  and  event  concepts  constitute  a  large  portion 
of  human  visual  knowledge,  therefore  learning  from 
video  data  is  a  promising  way  to  acquire  rich  com¬ 
mon  sense  knowledge. 

To  achieve  the  above  event  understanding  capa¬ 
bility,  we  need  to  address  the  following  problems: 

i)  Events  are  compositional.  An  event  can  often 
consist  of  a  sequence  of  actions  and  can  be  exe¬ 
cuted  in  multiple  ways.  Therefore  a  good  rep¬ 
resentation  must  be  hierarchical  and  account 
for  temporal  relations  between  sub-events. 

ii)  An  inference  algorithm  must  deal  with  event 
insertions,  interruptions,  multi- agent  events 
and  agent-object  interactions.  The  inference 
process  must  also  preserve  the  ambiguities 
both  in  the  lower  level  atomic  action  detection 
and  higher  level  event  recognition  to  achieve 
globally  optimized  solution. 

iii)  A  learning  algorithm  must  discover  the  struc¬ 
ture  of  the  events  from  video  data  with  mini¬ 
mal  user  supervision. 

1.2.  Overview  of  our  work 

In  this  paper,  we  represent  events  by  Temperal 
And-Or  Graph  (T-AOG).  The  AOG  was  first  intro¬ 
duced  to  compute  vision  in  [3]  and  [4]  for  modeling 
visual  objects,  and  has  been  used  in  [5]  to  analyze 
sports  videos. 

The  T-AOG  consists  of  a  set  of  terminal  nodes 
and  And,  Or-nodes.  A  terminal  node  specifies  a 
contextual  atomic  action  defined  by  a  set  of  spatial 
relations  (e.g.  agent  poses,  agent’s  interaction  with 
objects  in  the  scene)  grounded  in  the  images.  The 
And-nodes  and  Or-nodes  represent  verb  concepts 
and  are  composed  by  the  atomic  actions.  And- 
nodes  represent  temporal  compositions  of  their  chil¬ 
dren  nodes.  Or-nodes  represent  alternative  ways  to 
realize  events,  where  each  alternative  has  an  asso¬ 
ciated  probability  to  account  for  its  branching  fre¬ 
quency.  With  recursively  defined  And,  Or-nodes, 
the  T-AOG  specifies  a  stochastic  context  free  gram¬ 
mar  (SCFG)  whose  language  is  the  set  of  valid  con¬ 
figurations  of  events.  For  each  And-node,  we  model 
the  temporal  relations  of  its  children  nodes  to  dis¬ 
tinguish  events  with  similar  structures  but  differ¬ 
ent  temporal  patterns  and  interpolate  missing  por¬ 


tions  of  events.  This  makes  the  T-AOG  grammar 
context-sensitive. 

We  propose  an  inference  algorithm  for  T-AOG 
based  on  the  Earley  Parser  [6].  It  finds  the  most 
likely  parse  graph  by  iterative  bottom-up  detection 
and  top-down  inference  similar  to  the  image  pars¬ 
ing  algorithm  in  [7].  Our  inference  algorithm  is  de¬ 
signed  to  have  the  capacity  of  handling  interleav¬ 
ing  events  (e.g.  event  A  interrupts  event  B)  and 
online  prediction  of  future  events.  Due  to  ambigu¬ 
ity  arising  from  bottom-up  detections,  the  parsing 
algorithm  needs  to  keep  a  large  number  of  parse 
graphs.  For  computational  efficiency  we  prune  the 
parse  graphs  at  the  time  points  corresponding  to 
“deciding  moments” ,  so  it  is  much  more  affordable 
than  its  counterpart  in  image  grammar. 

We  propose  an  unsupervised  learning  algorithm 
to  learn  a  T-AOG  from  video.  The  learning  algo¬ 
rithm  uses  a  recursive  block  pursuit  procedure  to 
generate  terminal  nodes  and  And-nodes  from  the 
data  matrix  of  detected  spatial  relations.  The  ambi¬ 
guity  of  bottom-up  compositions  is  resolved  during 
the  recursive  block  pursuit.  Then  a  graph  compres¬ 
sion  procedure  is  then  used  to  generate  Or-nodes 
of  T-AOG.  The  learning  algorithm  is  guided  by  the 
information  projection  principle  that  minimizes  the 
total  description  length. 

1.3.  Related  work 

Existing  methods  for  event  representation  and 
recognition  can  be  divided  into  two  categories. 

1)  HMMs  and  DBN  based  methods.  Brand  et  al. 
[8]  modeled  human  actions  by  coupled  HMMs. 
Natarajan  [9]  described  an  approach  based  on 
Coupled  Hidden  Semi  Markov  Models  for  rec¬ 
ognizing  human  activities.  Kazuhiro  et.  al. 

[10]  built  a  conversation  model  based  on  dy¬ 
namic  Bayesian  network.  Al-Hames  and  Rigoll 

[11]  presented  a  multi-modal  mixed-state  dy¬ 
namic  Bayesian  network  for  meeting  event  clas¬ 
sification.  Although  HMMs  and  DBN  based 
algorithms  achieved  some  success,  the  HMMs 
do  not  model  the  high  order  relations  between 
sub-events,  and  the  fixed  structure  of  DBN  lim¬ 
its  its  power  of  representation. 

2)  Grammar  based  methods.  Ryoo  and  Aggarwal 

[12]  used  the  context  free  grammar  (CFG)  to 
model  and  recognize  composite  human  activi¬ 
ties.  Ivanov  and  Bobick  [13]  proposed  a  hierar¬ 
chical  approach  using  a  stochastic  context  free 
grammar  (SCFG).  Joo  and  Chellappa  [14]  used 
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probabilistic  attribute  grammars  to  recognize 
multi-agent  activities  in  surveillance  settings. 
Zhang  et  al  [15]  applied  an  extended  grammar 
approach  to  modeling  and  recognizing  complex 
traffic  events.  These  methods  focus  on  the  hi¬ 
erarchical  structure  of  events,  but  the  temporal 
relations  between  sub-events  are  not  fully  uti¬ 
lized.  There  are  other  methods  for  event  rep¬ 
resentation  and  reasoning  in  the  higher  level, 
such  as  VEML  and  VERL  [16,  17],  and  PADS 
[18]- 

In  contrast  to  HMMs  and  DBN,  the  T-AOG 
can  model  higher  order  constraints  than  HMMs, 
while  the  Or-nodes  enable  the  reconfiguration  of  the 
structures.  So  the  T-AOG  is  more  expressive  than 
the  fixed-structured  DBN.  The  T-AOG  also  repre¬ 
sents  the  temporal  relations  between  multiple  sub¬ 
events  by  the  horizontal  links  between  the  nodes, 
so  the  resulting  grammar  is  context-sensitive. 

Most  of  the  existing  work  predefine  the  event 
models  manually  and  learn  (or  define)  the  param¬ 
eters  of  the  models  for  a  predefined  set  of  event 
classes.  In  contrast,  we  study  an  unsupervised 
learning  algorithm  that  can  generate  richer  event 
classes,  reduce  tedious  manual  labeling,  thus  pro¬ 
vide  more  scalability  for  knowledge  acquisition  sys¬ 
tems.  Our  work  is  inspired  by  recent  progress  in  un¬ 
supervised  learning  and  data  mining  [19,  20]  as  well 
as  grammatical  learning  and  inference  [13,  21,  15] 
on  video  data.  For  event  grammar  learning,  our 
strategy  is  most  similar  to  Zhang  et  al.  [15],  which 
learns  a  stochastic  context  free  grammar  for  tra¬ 
jectory  analysis  of  multiple  agents  (e.g.  vehicles 
in  street  intersections).  In  contrast,  we  adopt  a 
richer  feature  representation  including  interactions 
between  agents  and  environments.  In  addition,  we 
append  a  Markov  model  of  time  constraints  for  ad¬ 
jacent  events,  resulting  in  a  stochastic  context  sensi¬ 
tive  grammar,  which  was  introduced  into  computer 
vision  by  Zhu  and  Mumford  in  [4].  The  stochas¬ 
tic  T-AOG  provides  an  efficient  representation  for 
knowledge  extracted  from  video. 

1.4-  Main  contributions 

The  contributions  of  our  paper  are: 

1)  We  represent  events  by  a  T-AOG  which  rep¬ 
resents  the  hierarchical  compositions  of  events 
and  the  temporal  relations  between  the  sub¬ 
events. 


2)  We  propose  an  unsupervised  learning  algo¬ 
rithm  to  learn  the  T-AOG  automatically  from 
video,  based  on  the  information  projection 
principle. 

3)  Our  parsing  algorithm  can  afford  to  generate 
all  possible  parse  graphs  of  single  events,  com¬ 
bine  the  parse  graphs  to  obtain  the  interpreta¬ 
tion  of  the  input  video,  and  achieve  the  global 
maximum-a-posteriori  inference. 

4)  The  agent’s  goal  and  intent  at  each  time  point 
is  inferred  by  a  bottom-up  and  top-down  pro¬ 
cess  based  on  the  top-ranked  parse  graphs  as 
the  most  probable  interpretations.  We  show  in 
human  experiments  that  our  parsing  algorithm 
can  correctly  infer  agent’s  goals  and  intents  ac¬ 
cording  to  the  video  content. 

5)  We  show  that  event  context  can  be  used  to 
improve  the  detection  result  of  atomic  actions, 
and  to  better  segment  and  recognize  objects  in 
the  scene.  We  put  the  event  learning  and  infer¬ 
ence  in  the  perspective  of  scene  context,  where 
there  is  a  rich  collection  of  agent-environment 
interactions.  By  inference  on  the  joint  prob¬ 
ability  of  agent  and  environment  events,  we 
show  how  to  use  recognition  of  actions  to  help 
object  recognition  and  scene  segmentation. 

6)  We  collect  a  video  data  set,  which  includes 
videos  of  daily  life  captured  both  in  indoor  and 
outdoor  scenes  to  evaluate  the  proposed  algo¬ 
rithm.  The  events  in  the  videos  include  single¬ 
agent  events,  multi- agent  events,  and  concur¬ 
rent  events.  The  results  of  the  algorithm  are 
evaluated  by  human  subjects  and  our  experi¬ 
ments  show  satisfactory  results. 

This  paper  is  an  enhanced  combination  of  our 
previous  conference  papers  [22]  and  [23]  which  fo¬ 
cus  on  event  parsing  and  grammar  learning  respec¬ 
tively.  Here  we  integrate  them  into  a  coherent 
framework.  We  add  more  experimental  results  to 
evaluate  the  proposed  algorithm,  and  new  exper¬ 
iments  on  segmenting  and  recognizing  objects  in 
scene  are  shown  in  this  journal  paper. 

2.  Event  representation  by  T-AOG 

In  this  section,  we  introduce  the  T-AOG  for  event 
representation. 

T-AOG  is  based  on  interactions  between  agents 
and  objects  in  the  scene.  In  the  videos  that  we 
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Figure  1:  The  detection  result  of  objects  in  the  office  scene 


collected,  there  are  13  classes  of  interest  objects  in¬ 
cluding  mug,  laptop,  water  dispenser  in  our  training 
and  testing  data.  These  objects  should  be  detected 
automatically,  however,  detection  of  multi-class  ob¬ 
jects  in  a  complex  scene  cannot  be  solved  perfectly 
by  the  state-of-art.  Therefore,  we  adopt  a  semi¬ 
automatic  object  detection  system.  The  objects  in 
each  scene  are  detected  by  the  Multi-class  boost¬ 
ing  with  feature  sharing  [24],  and  segmented  by  a 
recent  indoor  scene  parsing  algorithm  [25].  This  is 
not  time  consuming  as  it  is  done  only  once  for  each 
scene,  and  the  objects  of  interest  are  tracked  auto¬ 
matically  during  the  video  events.  Figure  1  shows 
the  detection  result  of  the  objects  of  interest  in  an 
office. 

2.1.  Grounded  relations  —  the  alphabet 

The  T-AOG  is  defined  on  a  set  of  unary  and  bi¬ 
nary  relations  which  can  be  directly  detected  from 
video.  We  call  these  relations  the  grounded  rela¬ 
tions. 

•  A  unary  relation  r(A)  is  a  time  varying  prop¬ 
erty  of  the  agent  or  object  A  in  the  scene.  As 
Figure  2  shows,  it  could  be  agent  poses,  e.g. 
Stand(personl)  and  Bend(person2),  and  ob¬ 
ject  states,  e.g.  Open(door)  and  Closed(door). 

•  A  binary  relation  r(A,  B )  is  the  spatial  rela¬ 
tion  (e.g.  Touch(personl.hand,  phone))  be¬ 
tween  A ,  B  which  could  be  agents,  body  parts 
(hands,  feet),  and  objects.  Figure  3  illustrates 
some  typical  relations. 

In  our  experiments  we  use  video  data  from  rela¬ 
tively  simple  scenes  with  few  people  appearing  at 


the  same  time.  In  this  case,  we  can  detect  the  spa¬ 
tial  relations  with  minor  ambiguity.  It  is  beyond 
the  scope  of  this  paper  to  study  complex  behaviors 
in  crowds  (e.g.  [26]). 

Table  1  specifies  the  24  unary  and  binary  re¬ 
lations  in  the  office  scene.  There  are  four  types 
of  relations:  agent  location  (roi  ri3),  agent- 
environment  interaction  (ri4  ~  ri7),  agent  pose 
(ris  rsj  r2i)  and  environment  event  (r22  rsj  r2i)- 
Here  we  do  not  use  the  “Off”  relation  as  shown 
in  Figure  3  since  we  can  infer  the  status  of  “Off” 
from  the  status  of  “On” .  The  details  of  how  these 
relations  are  detected  are  explained  in  Section  3.2. 
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Figure  2:  Some  unary  relations.  The  left  part  of  the  ta¬ 
ble  shows  the  four  unary  relations  as  agent  poses,  including 
’Stand’,  ’Stretch’,  ’Bend’  and  ’Sit’.  The  right  part  shows  the 
two  fluents  (’On’  and  ’Off’)  of  the  phone  and  the  screen  of 
laptop. 
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Figure  3:  Some  binary  relations  between  agents  (parts)  and 
background  objects. 


2.2.  Atomic  actions  —  the  terminal  nodes 

An  atomic  action  is  a  vector  of  grounded  relations 
a  =  (rr,  ...rj)  that  happen  sequentially  in  the  joint 
domain  of  space  and  time. 

Figure  4  shows  three  atomic  actions  defined  on 
the  grounded  relations.  Table  2  shows  the  atomic 
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Figure  4:  Some  atomic  actions.  Each  atomic  action  is  defined  on  a  set  of  grounded  relations  shown  by  2  half  circles.  Unary 
relations  ’Bend’  and  ’On’  are  defined  in  Figure  2.  Binary  relations  ’Near’  and  ’Touch’  are  defined  in  Figure  3.  For  the 
atomic  action  ’ShakeHands’,  when  PI  is  considered  as  the  agent,  P2  is  regarded  as  object  and  vice  versa.  See  [27]  for  a  more 
sophisticated  system  to  detect  agent  poses  and  interactions  with  the  scene. 


actions  used  in  the  office  scene.  These  atomic  ac¬ 
tions  are  learned  automatically  from  the  training 
data.  The  learning  process  is  explained  in  Section 
3. 

An  atomic  action  is  detected  when  all  its  rela¬ 
tions  are  detected  with  probability  higher  than  a 
given  threshold,  and  the  probability  of  the  atomic 
action  is  computed  as  the  product  of  the  proba¬ 
bilities  of  all  its  constituent  relations.  An  atomic 
action  a  =  (rr,  ...,rj),  has  the  following  probability 
given  a  short  video  snipet  Ii:t, 

1  J 

p(a  |  h:t)  =  z  Y[p(rj)  OC  exp{-£(a)}  (1) 

3  = 1 * 3 

where 

j 

E(a )  =  -J2i°&p(ri) 
j=i 

is  the  energy  of  a  and  Z  is  the  normalizing  con¬ 
stant  for  all  atomic  actions.  We  use  n  =  26  learned 
atomic  actions  shown  in  Table  2. 


In  our  experiments,  we  only  detect  several  sim¬ 
ple  agent  poses  (e.g.  standing,  sitting)  as  we  fo¬ 
cus  on  interactions  between  agents  and  objects  in 
the  scene.  In  future  work,  we  will  extend  our  ex¬ 
periments  to  detect  a  richer  collection  of  more  so¬ 
phisticated  agent  poses  using  animated  AND- OR 
Templates  [27]. 


Given  the  input  video  IA  in  a  time  interval  A  = 
[0,  T],  multiple  atomic  actions  are  detected  with 
probabilities  to  account  for  the  ambiguities  in  the 
grounded  relations  contained  in  the  atomic  actions, 
for  example,  the  relation  ’Touch(A,B)’  cannot  be 
clearly  differentiated  from  the  relation  ’Near  (A, B)’ 
unless  kinect  data  is  used.  The  other  reason  is  the 
inaccuracy  of  foreground  detection.  Fortunately, 
most  of  the  ambiguities  can  be  removed  by  the 
event  context  in  the  top-down  bottom-up  inference, 
we  will  show  this  in  the  experiment  section. 
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Figure  5:  T-AOG  for  events  in  the  office  scene.  S  is  the  root  node  which  represents  the  sequential  events  happened  in  the 
office.  It  is  a  Set-node  and  could  be  any  combinations  of  K  single  events.  For  example,  S  could  be  Ei\E2\EiE2\EsE2Es\.... 
Ei,...,Er  are  And-nodes  representing  single  events.  The  atomic  actions  are  also  represented  by  Set-nodes,  and  could  last  for 
1  to  n  frames.  The  temporal  relations  are  given  by  the  ratio  of  the  lasting  time  between  related  nodes.  For  clarity,  only  the 
temporal  relations  between  sub-events  are  shown. 


2.3.  The  T-AOG  for  events 

An  T-AOG  (see  Figure  5  for  an  example)  is  spec¬ 
ified  by  a  6-tuple 

T  -  AOG  =<  S,  VN ,  VT,  R,E,P>. 

S  is  the  root  node  for  an  event  category,  Vn  = 
yand  y  yor  jg  ge^  non_terminal  nodes  (events 
and  sub-events)  composed  of  an  And-node  set  and 
an  Or-node  set. 

Each  And-node  represents  an  event  or  sub-event, 
and  is  decomposed  into  sub-events  or  atomic  ac¬ 
tions  as  their  children  nodes.  The  children  nodes 
must  occur  in  a  certain  temporal  order. 

An  Or-node  has  a  number  of  alternative  ways  to 
realize  an  event  or  sub-event,  and  each  alternative 
has  a  probability  associated  with  it  to  indicate  the 
frequency  of  occurrence.  A  Set-node  is  a  special 
Or-node  which  can  repeat  m  times  with  probability 
p{m)  and  accounts  for  the  time  warping  effects. 

Vt  is  a  set  of  terminal  nodes  for  atomic  actions. 
R  is  a  number  of  relations  between  the  nodes  (tem¬ 
poral  relations),  E  is  the  set  of  all  valid  configura¬ 
tions  (possible  realizations  of  the  events)  derivable 
from  the  T-AOG,  i.e.  its  language,  and  P  is  the 
probability  model  defined  on  the  graph.  The  T- 
AOG  for  events  in  the  office  scene  is  shown  in  Fig¬ 
ure  5.  These  events  are  learned  from  the  training 
data  automatically  which  is  illustrated  in  the  next 
section. 


2.4.  A on-parametric  temporal  relations 

The  And-nodes  have  already  defined  the  tempo¬ 
ral  order  of  its  children- nodes,  and  the  Set-nodes 
representing  atomic  actions  have  modeled  the  last¬ 
ing  time  of  the  atomic  action  by  the  frequency  of  its 
production  rules.  Here  we  augment  the  T-AOG  by 
adding  temporal  constraints  between  related  nodes. 

Unlike  [13]  and  [15]  which  use  Allen’s  7  binary 
temporal  relations  [28],  we  use  non-parametric  fil¬ 
ters  to  model  the  relations  between  multiple  nodes. 
We  use  the  T-AOG  of  E\  shown  in  Figure  5  to  il¬ 
lustrate  the  temporal  relations.  Ei  is  an  And-node 
and  A ,  B  and  C  are  three  sub-nodes;  ta,  tb  and 
t c  are  the  lasting  time  of  A ,  B  and  C,  respectively. 
There  is  a  constraint  between  the  lasting  time  of  A , 
B  and  C.  For  example,  when  an  agent  does  event 
Ei  in  a  hurry,  the  lasting  time  of  A ,  B  and  C  will 
be  shorter  than  usual,  while  the  ratio  of  the  lasting 
time  between  A,  B  and  C  will  remain  stable.  This 
relation  r  is  modeled  by  a  distribution  of  a  function 
response  over  the  nodes  included  in  the  relation. 
We  use  te1  =  (ta,tb,tc)  to  represent  the  lasting 
time  of  Ei,  and  F  =  {Fi,F2,F^)  to  represent  the 
function  on  which  the  response  of  te1  is  modeled,  F 
could  be  regarded  as  a  filter  and  <  te1  ,  F  >  could 
be  regarded  as  a  filter  response.  We  use  histogram 
to  model  the  distribution  of  the  response,  and  the 
F*,  on  which  the  distribution  of  the  training  data’s 
response  has  the  minimum  entropy,  are  selected  to 
model  the  relation  as  in  [4].  Given  r  and  F* ,  the 
probability  of  the  relation  r  is 
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Table  1:  The  grounded  spatial  relations  of  T-AOG:  directly 
detectable  from  video. 


Name 

Definition 

Description 

roi 

absent  (agent) 

not  found  in  the 
frame 

ro2 

near  (agent, 
other_agent) 

near  other  agent 

ro3 

near  (agent, 

near  the  white 

board) 

board 

ro  4 

near  (agent, 
door) 

near  the  door 

^  05 

near  (agent,  dis¬ 

near  the  water  dis¬ 

penser) 

penser 

^  06 

near  (agent, 
trash_can) 

near  the  trash  can 

^  07 

near  (agent,  mug) 

near  the  mug 

ros 

near  (agent,  lap- 
top) 

near  the  laptop 

rod 

near  (agent, 
phone) 

near  the  phone 

rio 

near  (agent, 
basin) 

near  the  basin 

rn 

near  (agent,  mi¬ 

crowave) 

near  the  microwave 

r  12 

near  (agent, 
tea_box) 

near  the  tea  box 

ri3 

in  (agent,  door) 

agent  is  in  the  door 

ri4 

touch  (agent, 
keyboard) 

typing  on  keyboard 

ri5 

touch  (agent, 
mug) 

grabbing  the  mug 

rie 

touch  (agent, 
phone) 

grabbing  the  phone 

ri7 

touch  (agent, 

grabbing  the  tea 

tea_box) 

box 

ris 

bend  (agent) 

bend  down 

ri9 

sit  (agent) 

sitting  on  some¬ 
thing 

T*  20 

raise_arm(agent) 

raising  arm 

T 21 

stand  (agent) 

standing  straight 

^  22 

occlude(soccer 

soccer  match  on 

match,  screen) 

the  screen 

^  23 

on(phone) 

phone  is  in  use 

^  24 

on(screen) 

screen  is  on 

Table  2:  Learned  atomic  actions. 

Node  Semantic  Name  Contained  relations 
Name 


aoi 

absent 

m  i 

&02 

arrive  at  door 

^04,^21 

«03 

enter  door 

ro4,r2i,ri3 

ao4 

stand  near  phone 

^09,^21 

a05 

sit  near  phone 

G)9,  r19 

&06 

stand  and  use 
phone 

n)9,G21,7T6,r23 

a07 

sit  and  use  phone 

n)9,7T9,7T6,r23 

^08 

arrive  at  trashcan 

^06,^21 

a  09 

throw  trash 

G)6,  T1S 

a  10 

arrive  at  basin 

rio,r2i 

an 

dump  water 

rio,r18,  ri5 

&12 

arrive  at  dispenser 

^05,^21,^15 

^13 

use  dispenser 

ro5,n8,ri5 

ai4 

arrive  at  tea  box 

ri2,r2ur15 

&15 

use  tea  box 

ri2,r2ur15,r17 

ai6 

arrive  at  board 

^03,^21 

ai7 

discussion 

^ 03,  ^21,^02 

ais 

arrive  at  laptop 

^08,^21 

aig 

sit  near  laptop 

^08,^19 

«20 

watch  soccer 

n)8,  r19,  r22,  r24 

&21 

celebrate 

^08,^20,^22,^24 

&22 

use  laptop 

^08,^19,^14,^24 

&23 

arrive  at  mi¬ 

crowave 

ru,r2i 

a24 

use  microwave 

rn,rls 

«25 

arrive  at  mug 

ro7,r19 

«26 

take  mug 

ro7,r19,r15 
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p(r)  ~  h{<  r,  F*  >) 


(2) 


3.  Learning  the  T-AOG 


where  h  is  the  histogram  of  the  training  data’  s 
response  over  F* .  One  may  use  multiple  F  to  model 
the  relations  if  needed. 

2.5.  Parse  graph 

A  parse  graph  is  an  instance  of  the  T-AOG  ob¬ 
tained  by  selecting  variables  at  the  Or-nodes  and 
specifying  the  attributes  of  And-nodes  and  termi¬ 
nal  nodes.  We  use  pg  to  denote  the  parse  graph 
of  the  T-AOG  of  a  single  event  F*.  We  denote  the 
following  components  in  pg: 

•  Vt(P9 ')  =  {cli , ant(P0)}  is  the  set  of  leaf 
nodes  in  pg. 

•  Vor(pg)  =  {vx ,...,vnor(pg)}  is  the  set  of  non- 
empty  Or-nodes  in  pg ,  p(yi)  is  the  probability 
that  Vi  chooses  its  sub- nodes  in  pg. 

•  R(pg)  =  {rq, ...,  rn(#)}  is  the  set  of  temporal 
relations  between  the  nodes  in  pg.  Without 
temporal  relations,  the  pg  reduces  to  a  parse 
tree. 

The  energy  of  pg  is  defined  as  in  [4]: 
s(P9)  =  E  E(a,i)+  ^2  -log  p(vi) 

aiEVt(pg)  ViEVor  (pg) 

+  E  -!°g  p(ri)  (3) 

rieR(pg) 

The  first  term  is  the  data  term.  It  expresses  the 
energy  of  the  detected  terminal  nodes  (atomic  ac¬ 
tions)  which  is  computed  by  Eq.  1.  The  second 
term  is  the  frequency  term.  It  accounts  for  how 
frequently  each  Or-node  decomposes  in  a  certain 
way,  and  can  be  learned  from  the  training  data. 
The  third  term  is  the  relation  term  which  models 
the  temporal  relations  between  the  nodes  in  pg  and 
can  be  computed  by  Eq.  2. 

Given  input  video  IA  in  a  time  interval  A  =  [0,  T\. 
We  use  PG  to  denote  parse  graph  for  a  sequence  of 
events  in  S  and  to  explain  the  JA.  PG  is  of  the 
following  form, 

PG  =  (K,pgi,...,pgK) 

where  K  is  the  number  of  parse  graphs  for  events. 


3.1.  Information  projection 

The  unsupervised  learning  of  stochastic  T-AOG 
is  conducted  under  the  information  projection  and 
minimum  description  length  principle  [23] .  Here  we 
provide  a  review  of  the  related  theoretical  instru¬ 
ments. 

Let  =  {xi,  ...,xat}  be  positive  examples  (e.g. 
observed  video  clips)  governed  by  an  unknown  tar¬ 
get  distribution  /(x).  Let  be  a  large  set  of 
random  negative  examples  governed  by  a  reference 
distribution  q(x)  (here  q  is  an  i.i.d.  uniform  dis¬ 
tribution).  For  each  example  x,  a  list  of  spatial 
relations 

(n(x),r2(x),...,rD(x)) 

are  extracted  from  the  video  clip.  These  relations 
form  a  predefined  alphabet,  just  like  the  set  of  weak 
classifiers  in  adaboost.  Our  objective  of  learning  is 
to  pursue  a  model  p(x)  to  approximate  /(x)  in  a 
series  of  steps: 

<?(x)  =Po(x)  ^  Pi (x)  •••Pt(x)  =p(x)  S3  /(x) 

starting  from  q. 

The  above  model  updates  are  performed  by  se¬ 
lecting  a  most  informative  subset  from  all  the  spa¬ 
tial  relations.  The  model  p  after  T  iterations  con¬ 
tains  T  selected  spatial  relations  {rt  :  t  =  1,  ...,T}. 
If  the  selected  spatial  relations  capture  all  the  re¬ 
lated  information  about  the  scene  semantics  in  x,  it 
can  be  shown  by  variable  transformation  [29]  that: 

p(x)  =  p(ri,...,rr) 
q(x)  q(n,...,rT)' 

So  p  can  be  constructed  by  reweighting  q  with  the 
marginal  likelihood  ratio  on  selected  spatial  rela¬ 
tions. 

Under  the  maximum  entropy  principle,  p(x)  can 
be  expressed  in  the  following  log-linear  form: 

T  n  1 

p(x)  =  <?(x)TT  —  exp{/3trt(x)}  .  (4) 

t  i  LZl  J 

where  /3t  is  the  parameter  for  the  t-th  selected  spa¬ 
tial  relation  rt  and  zt  ( zt  >  0)  is  the  individual 
normalization  constant  determined  by  [3t: 

Zt  =  yy(rt)exp{/3trt}. 
n 
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t  =  l  t  =  2  t  =  T 

ri  r2  r3  ...  r*  rt  r2  r3  ...  rK  .  r,  r3  r3  ...  rK 


Figure  6:  Pursuing  homogeneous  blocks  from  the  data  ma¬ 
trix.  Each  block  corresponds  to  a  terminal  node  or  an  And- 
node  in  T-AOG. 

By  the  information  projection  principle  [30,  31, 
29],  we  adopt  a  step-wise  procedure  to  for  select¬ 
ing  spatial  relations.  In  particular,  the  t-th  spatial 
relation  rt  is  selected  and  model  pt  is  updated  by: 

pt  =  argmin  JC(pt\pt-i) 

1  N 

s.t.  EPt[rt\  =  —  y^rt(xj)  (5) 

V  2=1 

where  /C  denotes  the  Kullback-Leibler  divergence, 
and  by  minimizing  it  we  select  a  most  informative 
spatial  relation  rt  to  augment  pt-i  towards  pt.  The 
constraint  equation  in  Eq.  (5)  ensures  that  the  up¬ 
dated  model  is  consistent  with  the  observed  train¬ 
ing  examples  on  marginal  statistics.  The  optimal 
Pt  can  be  found  by  a  simple  line  search  or  gradient 
descent  to  satisfy  the  constraint  in  Eq.  (5). 

3.2.  Block  pursuit  on  data  matrix 

Data  matrix.  Firstly  we  set  up  a  data  matrix 
R  using  spatial  relations  of  positive  training  exam¬ 
ples  as  shown  in  Figure  6.  Each  row  of  R  is  the 
vector  of  spatial  relations  detected  from  one  exam¬ 
ple  (or  video  clip)  in  X+.  For  simplicity,  we  assume 
all  positive  training  examples  are  aligned  and  have 
the  same  dimensionality.  Therefore  R  is  a  matrix 
with  N  (number  of  positive  examples)  rows  and  D 
(number  of  all  candidate  spatial  relations)  columns, 
and  each  entry 

Rij  =  rj(xi) 

is  a  binary  response.  R^-  =  1  means  the  spatial 
relation  j  holds  in  example  x^. 

Block  pursuit.  On  the  data  matrix,  we  pursue 
large  homogeneous  blocks  {Bk  :  k  =  1, ...,  K}.  A 
block  is  specified  by  a  set  of  common  spatial  rela¬ 
tions  (columns)  that  co-occur  in  a  set  of  examples 
(rows).  Each  block  corresponds  to  a  frequent  verb 


concept,  i.e.  an  terminal  node  or  And- node  com¬ 
posed  by  several  spatial  relations.  For  example, 
the  verb  concept  ao2  (arrive  at  the  door)  in  Table 
2  is  composed  by  two  spatial  relations:  near  (agent, 
door)  and  stand(agent).  The  verb  concept  emerges 
from  data  because  it  appears  frequently  and  with 
high  confidence,  thus  it  is  readily  represented  by 
an  AND  node  that  strongly  binds  its  constituent 
relations.  Quantitatively,  we  can  measure  this  by 
the  information  gain  of  block  S/e,  computed  by  the 
summation  over  the  block: 

Gain(£>k)  =  ^  (f3k jRitj -log zkJ)  (6) 

2  e  rows  (Bk) 

j  e  cols (J3k) 

where  rows(-)  and  cols(-)  denote  the  rows  and 
columns  of  block  Bk •  cols  (Bk)  correspond  to 
the  selected  spatial  relations,  capturing  their  co¬ 
occurrence  in  space  and  time.  And  rows  (Bk)  are 
the  examples  that  belong  to  the  k- th  block.  f3kj 
is  the  multiplicative  parameter  of  selected  spatial 
relation  j,  and  zkj  is  the  individual  normalizing 
constant  determined  by  (3 kj.  Eq.  (6)  measures  the 
information  gain  by  explaining  the  submatrix  cov¬ 
ered  by  Bk  using  the  foreground  model  p  instead  of 
the  background  model  q.  Similar  approaches  have 
also  been  adopted  in  the  grammar  learning  of  tex¬ 
tual  data  [32]. 

Recall  that  we  pursue  a  series  of  models  starting 
from  q(x)  to  approximate  the  target  distribution 
/(x)  governing  training  positives  A+.  This  corre¬ 
sponds  to  maximizing  the  log-likelihood  logp(x)  on 
T+.  Initially  p  =  q,  and  the  data  matrix  has  a 
log-likelihood  Lq(R).  After  pursuing  K  blocks,  the 
resulting  image  log-likelihood  is: 

K 

L  =  Lq  +  ^  Gain(£>fc).  (7) 

k  =  1 

The  block  pursuit  algorithm  is  a  greedy  procedure 
that  maximizes  the  log-likelihood  in  Eq.  (7).  Each 
time  we  permute  rows  and  columns  the  data  matrix 
to  pursue  the  block  with  the  largest  gain  as  com¬ 
puted  in  Eq.  (6).  The  entries  covered  by  the  block 
are  then  explained  away  and  excluded  from  subse¬ 
quent  block  pursuit.  This  procedure  is  repeated  un¬ 
til  the  information  gain  of  the  newly  pursued  block 
is  negligible. 

To  penalize  the  model  complexity,  we  apply  a 
constant  penalty  for  each  additional  block  learned. 
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This  is  equivalent  to  imposing  a  Laplacian  prior  on 
the  size  of  the  learned  grammar. 

The  above  block  pursuit  procedure  can  be  imple¬ 
mented  either  by  clustering,  which  produces  multi¬ 
ple  blocks  or  non-terminal  nodes  at  the  same  time, 
or  by  stepwise  pursuit,  which  produces  one  block  or 
non-terminal  node  at  a  time. 

The  block  pursuit  procedure  for  T-AOG  is  car¬ 
ried  out  into  two  stages.  (1)  Learn  a  set  of  terminal 
nodes  as  blocks  on  the  data  matrix  of  grounded 
spatial  relations.  These  terminal  nodes  account  for 
atomic  events  which  directly  specify  spatial  tem¬ 
poral  configurations  of  grounded  relations.  This  is 
done  by  clustering.  (2)  Learn  non-terminal  nodes 
as  blocks  on  the  data  matrix  of  atomic  actions,  to 
account  for  longer  events  composed  of  atomic  ac¬ 
tions. 

3.3.  Detecting  grounded  spatial  relations 

As  a  preprocessing  step,  we  perform  one  round  of 
bottom  up  detection  for  grounded  spatial  relations. 

Firstly  we  use  a  standard  background  subtrac¬ 
tion  algorithm  to  segment  moving  agent  and  fluent 
changes  of  objects,  and  use  a  commercial  surveil¬ 
lance  system  to  track  the  detected  agent. 

The  relations  of  agents’  location  (ro  rsj  r 13)  are 
detected  by  the  distance  between  agent  and  objects 
which  belongs  to  normal  distribution.  The  location 
of  the  agent  is  detected  by  combining  foreground 
segmentation  and  skin  color  detection  that  locates 
head  and  hands  of  the  agent  .  Then  the  distance 
between  agent  and  objects  is  computed  directly  as 
the  location  of  objects  are  known  (automatically 
detected  or  manually  labeled). 

The  agent  pose  is  inferred  by  a  nearest  neigh¬ 
bor  classifier  using  both  pixels  and  foreground 
segmentation  map  within  the  estimated  bound¬ 
ing  box  for  the  agent.  An  illustration  of  four 
poses  using  segmented  foreground  mask  is  shown 
in  Figure  7.  The  agent-environment  interac¬ 
tion  touch  (agent ,  keyboard)  and  touch  (agent , 
phone)  are  detected  by  checking  whether  there  is 
enough  skin  color  within  the  designated  area  for  the 
laptop  and  phone,  which  are  static  objects  in  the  of¬ 
fice  environment.  The  relation  touch  (agent ,  mug) 
and  touch(agent,  tea  box)  are  also  detected  us¬ 
ing  skin  color,  and  also  the  unique  color  and  shape 
of  the  mug  and  tea  box.  When  a  relation  involves 
an  object,  the  object  is  tracked  until  the  relation 
finishes  and  the  new  position  of  the  object  will  be 
updated. 


The  environment  relations  occlude  (soccer 
match,  screen)  is  determined  by  checking 
whether  there  is  large  amount  of  green  color  within 
the  designated  area  of  laptop.  The  on  relations  are 
detected  by  the  properties  of  the  object  area  such 
as  intensity  histogram  of  the  bounding  box. 

Using  the  techniques  described  above,  we  detect 
grounded  relations  for  every  video  frame.  The  de¬ 
tection  result  is  organized  as  a  spatial  temporal  ta¬ 
ble  where  each  row  corresponds  to  a  time  frame. 
Each  column  corresponds  to  a  grounded  relation. 


Figure  7:  Standing,  bending,  sitting  and  raising-arm  poses. 


3-4-  Learning  atomic  actions 

We  define  atomic  actions  to  be  simple  and  tran¬ 
sient  events  composed  spatially  and  temporally  by 
grounded  relations.  To  learn  an  alphabet  of  atomic 
actions,  we  use  a  temporal  scanning  window  span¬ 
ning  5  frames  to  collect  a  large  number  of  small 
clips.  Each  5-frame  clip  is  described  by  the  detected 
relation  vector: 

{(1*1,1, ri ,£>, ...,  r5,i, r5,£>)} 

where  D  =  24  is  the  number  of  grounded  rela¬ 
tions  detected  per  frame.  A  k-means  clustering  is 
then  performed  on  the  grounded  relation  vectors 
of  these  5-frame  clips,  using  the  simple  Hamming 
distance  as  the  metric.  And  a  centroid  of  a  clus¬ 
ter  is  simply  determined  as  the  grounded  relation 
vector  that  has  minimal  distance  to  all  the  cluster 
members.  As  the  times  pan  is  very  small,  we  can 
assume  that  the  grounded  relations  {e.g.  agent  lo¬ 
cation,  pose)  stay  constant  during  the  short  period. 
So  we  constrain  the  centroids  to  be  stationery,  i.e. 
1*1  :d  =  r2, d  —  •••r5,cbVd  =  1,  ...D.  For  each  clus¬ 
ter,  we  estimate  the  symbol  probabilities  p( iq),  ..., 
^(1*24)  by  counting  the  member  sub-sequences  of  the 
cluster.  And  we  represent  this  stochastic  model  by 
its  mode  (the  most  likely  sub-sequence)  as  the  clus¬ 
ter  prototype  1*^4  for  brevity.  Each  cluster  cor¬ 
responds  to  a  block  pursued  in  the  data  matrix  in 
Figure  6. 

The  result  of  clustering  is  a  list  of  26  atomic  ac¬ 
tions  shown  in  Table  2.  Each  atomic  action  is  repre¬ 
sented  by  a  list  of  grounded  relations  that  are  acti¬ 
vated.  The  semantic  description  for  these  atomic 


10 


N4:  walking  in  passage  N,,:  standing  at  water  Nr3:  standing  at  trashcan 


Nf:  sitting  at  desk  N6:  sitting  at  desk,  typing 


N2:  standing  at  door 


0  '  2DD 


Figure  8:  The  duration  model  for  the  length  of  repetition. 


actions  is  in  Table  2.  The  atomic  actions  that 
happen  most  frequently  include  aig  (sit  near  lap¬ 
top),  a22  (use  laptop),  a2o  (watch  soccer)  and  ao3 
(enter  door),  aig,  <222  can  be  considered  as  con¬ 
stituent  components  of  a  longer  event  “working  by 
laptop”.  ao3  indicates  the  student  is  entering  or 
leaving.  The  the  learned  atomic  actions  and  their 
relative  frequencies  are  representative  and  truthful 
to  the  video  data. 

Now  the  sequence  of  multi-dimensional  relations 
is  encoded  by  the  alphabet  of  26  atomic  actions.  For 
the  computational  efficiency  in  discovering  longer 
events,  we  use  hard  assignments  by  computing  the 
most  likely  atomic  action  per  every  5  frames.  The 
resulting  sequence  of  atomic  actions  is 

W1:T  =  (wi,  ...,wt),  where  wt  G  {a0 1,  ...,a26} 

and  T  is  the  total  number  of  video  frames  divided 
by  5. 

3.5.  Learning  longer  events  and  T-AOG 

There  is  large  variation  in  the  duration  of  atomic 
actions.  For  example,  a  student  may  repeatedly 
enter  the  office,  work  for  a  varying  time  and  leave 
the  office.  If  we  naively  group  atomic  actions  into 
longer  ones,  we  get  a  large  number  of  repetitive 
patterns  of  various  lengths,  providing  little  informa¬ 
tion.  To  deal  with  temporal  variation,  we  perform  a 
simple  compression  operation:  every  repetitive  sub¬ 
sequence  is  summarized  into  one  symbol  (e.g.  bbbb 
substituted  by  b  ).  We  may  interpret  this  opera¬ 
tion  as  learning  a  large  number  of  grammar  rules 
in  the  form  N  NN...N  with  various  lengths  of 
repetition.  We  estimate  a  non-parametric  model 
(Figure8)  for  the  length  of  repetition,  or  duration 
under  maximum  likelihood  principle. 

After  compression,  the  original  sequence  of 
atomic  actions  is  transformed  into  a  much 


shorter  one  ci  :m  (A/-  «  T)  where  each  symbol 
Ci  takes  value  from  the  same  domain  as  7/7. 

There  will  be  some  frames  that  non  of  the  re¬ 
lations  are  activated  except  7*21,  that  is,  in  these 
frames  the  agent  just  stand  somewhere  that  not 
near  any  interested  objects.  These  frames  are  re¬ 
garded  as  background  frames,  that  is  during  these 
frames,  no  interest  event  or  action  happened.  The 
background  frames  and  the  frames  in  which  absent 
is  detected  are  used  to  separate  the  video  into  dif¬ 
ferent  sequences,  each  sequence  is  an  single  event. 
One  example  sequence  (event)  is  <201,^02?  ^03?  if 
is  the  entering  event,  which  composed  of  absent, 
arrive  at  door  and  enter  door.  Another  example 
is  (225,  a26,  ^14,  «i5,  ai2,  ai3.  If  is  fhe  taking  water 
event,  which  composed  of  arrive  mug,  take  mug, 
arrive  at  tea  box,  use  tea  box,  arrive  dispenser,  use 
dispenser.  These  sequences  are  used  to  learn  the 
grammar. 

We  then  scan  the  sequence  ci :m  to  collect  subse¬ 
quences  of  length  l  (l  =  2  in  our  system)  and  form  a 
data  matrix.  Now  the  columns  of  this  data  matrix 
are  atomic  actions  instead  of  grounded  relations.  A 
large  number  of  homogeneous  blocks  (i.e.  frequent 
sub-sequences)  are  identified  from  the  data  matrix. 
They  are  candidates  for  the  right  hand  side  of  pro¬ 
duction  rules  in  the  event  grammar.  From  the  can¬ 
didates,  we  select  a  subset  of  production  rules  in  a 
step  wise  fashion. 

The  proposed  candidate  production  rule  takes 
the  form  a  /Fp  It  re-encodes  the  current  se¬ 
quence  into  a  new  sequence  by  replacing  all  occur¬ 
rences  of  /?7  by  a.  By  doing  this,  the  reduction  in 
description  length  is  computed  as: 


reduction  =  +  A2  +  A3  —  constant  (8) 


and, 

Ai  =  ria  •  (log  7)  -  log  —  -  log  — ) 

V  n'  n  n  J 

A2  =  rip- (log  -J  -  log  — )  +ri  ■  (log  -  log  — ) 
H  \  u  n  J  r  \  n'  n  J 

77 

A3  =  (ri  -rip-ri7-  O  •  log  — 

where  72^,72^,72^  are  the  frequencies  of  a,/3, 7  in 
the  new  sequence  respectively,  727,  727  are  the  corre¬ 
sponding  frequencies  in  the  current  sequence,  n  is 
the  length  of  the  current  sequence,  n'  =  n  —  n'a  is 
the  length  of  the  new  sequence.  We  rank  the  can¬ 
didate  production  rules  using  Eq.8  and  select  the 
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Table  3:  Learned  production  rules  of  T-AOG.  For  simplicity, 
we  omit  the  starting  symbol  S  and  the  branching  probabili¬ 
ties  that  S  produces  the  following  non-terminal  nodes. 


Production  rule 

Semantic 

Noi  — >  aoiUo2tt03tt02 

absent,  arrive  at  door,  en- 
terdoor,  arrive  at  door 

N02  — >  U02ao3tt02tt01 

arrive  at  door,  enter  door, 
arrive  at  door,  absent 

N03  — »■  ao4«06 

stand  near  phone,  stand 
and  phone 

N04  — >  ao5&07 

sit  near  phone,  sit  and  use 
phone 

N05  — >  U25&26 

arrive  at  mug,  take  mug 

N06  — y  aioan 

arrive  at  basin,  tdump  wa¬ 
ter 

N Q7  -A  <214*215 

arrive  at  tea  box,  use  tea 
box 

Nos  — >  Ul2ttl3 

arrive  at  dispenser,  use  dis¬ 
penser 

N09  —>  &26&25 

take  mug,  arrive  at  mug 

N10  -»■ 

take  mug,  dump  water, 

-No  5  -No  6  N07N08  -No  9 

make  tea,  take  water,  take 
mug 

N11  —>  N05N07N08N09 

take  mug,  make  tea,  take 
water,  take  mug 

N12  — >  N05N08N09 

take  mug,  take  water,  take 
mug 

Nl3  -a  N05N06N08N09 

take  mug,  dump  water, 
take  water,  take  mug 

-N 14  -A  ai8ttl9tt22 

arrive  at  laptop,  sit  near 
laptop,  use  laptop 

-N 15  — >  Ni4<2i9 

use  laptop,  sit  near  laptop 

-N 16  — y  a20tt21 

watch  soccer,  celebrate 

N17  — >  N14N16,  aig 

use  laptop,  watch  soccer, 
sit  near  laptop 

largest  one.  This  learning  procedure  is  recursively 
carried  out,  until  the  reduction  of  description  length 
is  too  small  for  any  new  candidate  production  rule. 

As  a  result,  we  obtain  a  dictionary  of  new  pro¬ 
duction  rules  shown  in  Table  3,  where  to  make  the 
grammar  more  compact  we  merge  shorter  produc¬ 
tion  rules  into  a  longer  ones  that  maximally  reduce 
the  description  length. 

We  can  see  from  the  table  that  TVio  ^  W3  are 
taking  water  events,  we  can  cluster  them  by  the 
objects  involved  in  them,  the  mug.  Similarly,  N15 
and  TVi7  are  clustered  by  the  laptop.  Then,  we 
can  align  them  to  learn  the  OR-Node.  We  intro¬ 
duce  a  special  event  (action)  “NULL”  to  repre¬ 
sent  that  the  NULL  event  (action).  It  represents 
the  event  or  action  that  is  not  interested.  We  put 
NULL  event  in  the  aligned  sequence  as  show  in  Fig¬ 
ure.  9,  and  by  combining  the  production  rules  ( e.g . 
NaNULL  U  N4N5  Na(NULL  U  TVs))  we  get  a 
stochastic  T-AOG  for  each  clustered  event.  The  T- 
AOG  of  the  take  water  event  is  illustrated  in  Fig¬ 
ure.  10,  where  for  brevity  we  only  show  the  graph 
structure  and  omit  the  branching  probabilities  of 
OR  nodes.  Here  an  AND  node  represents  an  event 
is  decomposed  into  sub-events  or  atomic  actions;  an 
OR  node  represents  alternative  ways  to  realize  an 
event.  The  T-AOG  presents  a  large  amount  of  node 
sharing  in  the  compositional  hierarchy. 

The  terminal  nodes  {ai,  a2, ...}  and  non-terminal 
And-nodes  form  a  compositional  hierarchy.  By 
learning  them  altogether,  we  greatly  reduce  the  am¬ 
biguity  of  segmenting  video  into  events  and  atomic 
actions. 


3.6.  Learning  the  parameters  of  T-AOG 

After  the  structure  (i.e.  And-Or  nodes)  of  T- 
AOG  is  learned,  we  can  compute  the  probability 
of  each  branch  of  OR-Node  by  counting  the  time 
each  branch  appears.  This  is  essentially  a  maximum 
likelihood  estimation.  The  details  can  be  found  in 
[4].  Let  V°r  be  an  Or- node  and  v  be  an  index  of 
one  of  Uior,s  branches,  then 

nror  \  ^ pgePG^-Vi(pg)=v 

p(K  =»)  = - pcj - 

where  PG  is  the  set  of  all  parse  graphs  on  the  train¬ 
ing  data. 
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Figure  9:  The  aligned  rules  of  fetching  water. 
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Figure  10:  The  learned  T-AOG  of  fetching  water. 


4.  Event  parsing  with  Goal  inference  and  In¬ 
tent  Prediction 

In  this  section,  we  first  show  the  event  parsing 
process  by  assuming  that  there  is  only  one  agent 
in  the  scene  in  Section  4.1  -  4.3.  In  Section  4.4  we 
show  how  to  parse  events  when  there  are  multiple 
agents  in  the  scene. 

f.l.  Formulation  of  event  parsing 

The  input  of  our  algorithm  is  a  video  JA  in  a  time 
interval  A  =  [0,T],  and  atomic  actions  are  detected 
at  every  frame  It.  We  denote  by  Apgi  the  time  ex¬ 
plained  by  parse  graph  pgim  PG  =  ( K,pg\ , 
is  regarded  as  an  interpretation  of  JA  where 

f  U fLiPpgi  =  A 

\  A pgi  n  A pgj  =  0  \/ij  i  7^  j 

We  use  a  small  T-AOG  in  Figure  11(a)  to  illus¬ 
trate  the  algorithm.  Figure  11(b)  shows  a  sam¬ 
ple  input  of  atomic  actions.  Note  that  there  are 
multiple  atomic  actions  at  each  time  point.  Fig¬ 
ure  11(c),  (d)  and  (e)  show  three  possible  parse 
graphs  (interpretations)  of  the  input  up  to  time  £4. 
PG\  =  (l,pgi)  in  figure  11(c)  is  an  interpretation 
of  the  video  /[tl)t4]  and  it  segments  /[tljt4]  into  one 
single  event  E\  at  the  event  level,  and  segments 


Figure  11:  (a)  A  small  T-AOG.  (b)  A  typical  input  of  the 
algorithm.  (c),(d)  and  (e)  are  three  possible  parse  graphs 
(interpretations)  of  the  video  A\[ti,t4]-  Each  interpretation 
segments  the  video  -b\[ti,t4]  into  single  events  at  the  event 
level  and  into  atomic  actions  at  the  atomic  action  level. 


I[t1:tA]  into  three  atomic  actions  ai,  as  and  <24  at  the 
atomic  action  level.  PG2  =  (2^pg2^pgs)  in  Figure 
11(d)  segments  /[tl)t4]  into  two  single  events  E\  and 
E2,  where  E2  is  inserted  in  the  process  of  E\.  Sim¬ 
ilarly  PGs  =  (2,pg^pg$)  in  11(e)  is  another  parse 
graph  and  segments  I[t1:tA]  into  two  single  events 
Ei  and  E2. 

We  can  see  that  the  segmentation  of  events  is 
automatically  integrated  in  the  parse  process  and 
each  interpretation  could  segment  the  video  JA  into 
single  events,  and  remove  the  ambiguities  in  the 
detection  of  atomic  actions  by  the  event  context. 
The  energy  of  PG  is 


K 

E(PG  |  IA)  =  p(K)  ]T( e(pgk  \  IApgJ  -  log p(k)) 

k= 1 

(10) 

where  p(k)  is  the  prior  probability  of  the  single 
event  whose  parse  graph  in  PG  is  pp/-,  and  p(K) 
is  a  penalty  item  that  follows  the  poisson  distribu- 

tion  as  p(K)  =  T  K[ —  where  A t  is  the  expected 
number  of  parse  graphs  in  IA.  The  probability  for 
PG  is  of  the  following  form 

p(PG  |  /A)  =  4  exp {-E(PG  |  IA)}  (11) 

where  Z  is  the  normalization  factor  and  is  summed 
over  all  PG  as  Z  =  exp {—E(PG  |  /A)}.  The 

most  likely  interpretation  of  JA  can  be  found  by 
maximizing  the  following  posterior  probability 

PG*  =  argmaxp(PG  |  /A)  (12) 
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Figure  12:  (a)  The  two  T-AOGs  of  single  event  E\  and  E2. 
(b)  The  input  in  the  worst  case,  (c)  The  parse  graphs  at 
time  t\.  (d)  The  parse  graphs  at  time  £2 


When  the  most  possible  interpretation  is  ob¬ 
tained,  the  goal  at  frame  It  can  be  inferred  as  the 
single  event  whose  parse  graph  pgi  explains  It,  and 
the  intent  can  be  predicted  by  the  parse  graph  pgi . 

4.2.  Generating  parse  graphs  of  single  events 

We  implemented  an  online  parsing  algorithm  for 
T-AOG  based  on  Earley’s  [6]  parser  to  generate 
parse  graphs  based  on  the  input  data.  Earley’s  al¬ 
gorithm  reads  terminal  symbols  sequentially,  creat¬ 
ing  a  set  of  all  pending  derivations  (states)  that  is 
consistent  with  the  input  up  to  the  current  input 
terminal  symbol.  Given  the  next  input  symbol,  the 
parsing  algorithm  iteratively  performs  one  of  three 
basic  operations  (prediction,  scanning  and  comple¬ 
tion)  for  each  state  in  the  current  state  set. 

For  clarity,  we  use  two  simple  T-AOGs  of  E\  and 
E2  without  set  nodes  as  shown  in  Figurel2(a)  to 
show  the  parsing  process.  Here  we  consider  the 
worst  case,  that  is,  at  each  time,  the  input  will  con¬ 
tain  all  the  atomic  actions  in  E\  and  E2  as  shown 
in  Figurel2(b).  At  time  to,  in  the  prediction  step, 
Ei  s  first  atomic  action  a\  and  E2  s  first  atomic 
action  a 4  are  put  in  the  open  list.  At  time  1 1,  in 
the  scanning  step,  since  cq  and  a 4  are  in  the  in¬ 
put,  they  are  scanned  in  and  there  are  two  partial 
parse  graphs  at  t\  as  shown  in  Figure  12(c).  Notice 
that  we  do  not  remove  a\  and  a 4  from  the  open  list. 
This  is  because  the  input  is  ambiguous,  if  the  input 
at  t\  is  really  cq,  then  it  cannot  be  a 4  and  should 
not  be  scanned  in  and  should  stay  in  the  open  list 
waiting  for  the  next  input.  It  is  the  same  that  if  the 
input  at  t\  is  really  a 4.  Then  based  on  the  parse 
graphs,  a2,a^  and  a$  are  predicted  and  put  in  the 
open  list.  Then  at  time  ti,  we  have  cq,  a2,  <23,  <24,  <25 
in  the  open  list.  At  time  t2 ,  all  of  the  five  nodes 


in  the  open  list  are  scanned  in  and  we  will  have  7 
parse  graphs  (five  new  parse  graphs  plus  the  two 
parse  graphs  at  t\)  as  shown  in  Figure  12(d).  The 
two  parse  graphs  at  t\  are  kept  unchanged  at  t2  to 
preserve  the  ambiguities  in  the  input.  This  process 
will  continue  iteratively  and  all  the  possible  parse 
graphs  of  E\  and  E2  will  be  generated. 

4.3.  Run-time  incremental  event  parsing 

As  time  passes,  the  number  of  parse  graphs  will 
increase  rapidly  and  the  number  of  the  possible  in¬ 
terpretations  of  the  input  will  become  huge,  as  Fig¬ 
ure  13(a)  shows.  However,  the  number  of  accept¬ 
able  interpretations  ( PG  with  probability  higher 
than  a  given  threshold)  does  not  keep  increasing, 
it  will  fluctuate  and  drop  sharply  at  certain  time, 
as  shown  in  Figure  13(b).  We  call  these  time  points 
the  “decision  moments”.  This  resembles  human 
cognition.  When  people  watch  others  taking  some 
actions,  the  number  of  possible  events  could  be 
huge,  but  at  certain  times,  when  some  critical  ac¬ 
tions  occurred,  most  of  the  alternative  interpreta¬ 
tions  can  be  ruled  out. 

Our  parsing  algorithm  behaves  in  a  similar  way. 
At  each  frame,  we  compute  the  probabilities  of  all 
the  possible  interpretations  and  only  the  acceptable 
interpretations  are  kept.  The  parse  graphs  which 
are  not  contained  in  any  of  these  acceptable  inter¬ 
pretations  are  pruned.  This  will  reduce  the  com¬ 
plexity  of  the  proposed  algorithm  greatly. 

4-4-  Multi- agent  Event  parsing 

When  there  are  multiple  agents  in  the  scene, 
we  can  do  event  parsing  for  each  agent  separately. 
That  is,  for  each  agent  in  the  scene,  the  atomic 
actions  are  detected  (all  other  agents  are  regarded 
as  objects  in  the  scene)  and  parsed  as  mentioned 
above,  then  the  interpretations  of  all  the  agents  in 
the  scene  are  obtained. 

5.  Experiments 

5.1.  Data  set 

For  evaluation,  we  collect  videos  in  5  indoor  and 
outdoor  scenes,  including  office,  lab,  hallway,  cor¬ 
ridor  and  near  vending  machines.  Figure  14  shows 
some  screen-shots  of  the  videos.  The  training  video 
total  lasts  for  60  minutes,  and  contains  34  types 
of  atomic  actions  (26  of  the  34  types  of  atomic  ac¬ 
tions  are  listed  in  Table  2  for  the  office  scene)  and 
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Figure  13:  (a)  The  number  of  possible  interpretations  (in 
logarithm)  vs  time  (in  seconds),  (b)  The  number  of  accept¬ 
able  interpretations  vs  time.  The  decision  moments  are  the 
time  points  on  which  the  critical  actions  happen  and  the 
number  of  acceptable  interpretations  drops  sharply. 


(a)  Lab  (b)  Hallway  (c)  Corridor  (d)  Outdoor 


Figure  14:  Some  screen  shots  of  the  data. 


12  events  categories.  Each  event  happens  3  to  10 
times. 

The  structures  of  the  T-AOG  are  learned  auto¬ 
matically  from  the  training  data  as  described  in  sec¬ 
tion  3,  the  parameters  and  temporal  relations  are 
also  learned  from  the  training  data.  The  testing 
video  lasts  50  minutes  and  contains  12  event  cate¬ 
gories,  including  single-agent  events  like  getting  wa¬ 
ter  and  using  a  microwave,  and  multi-agent  events 
like  discussing  at  the  white  board  and  exchanging 
objects.  The  testing  video  also  includes  event  in¬ 
sertion  such  as  making  a  call  while  getting  water. 

5.2.  Event  recognition 

The  performance  of  event  recognition  is  shown 
in  Table  4.  Figure  15  shows  the  recognition  results 
of  events  which  may  involve  multiple  agents  and 
happen  concurrently. 

Using  the  learned  T-AOG,  we  parse  the  sequence 
of  atomic  actions  extracted  from  a  long  video  in 
Figurel6.  The  sequence  is  already  compressed  so 
that  repeating  subsequences  are  suppressed  into 
single  symbols.  In  the  zoomed-out  parts  of  the 
parse  graph  in  Figure  16,  we  also  show  the  detected 
bounding  boxes  of  the  agent.  The  semantic  descrip¬ 


tion  for  different  non-terminal  nodes  is  also  illus¬ 
trated. 

5.3.  Goal  inference  and  intent  prediction 

Besides  the  classification  rate,  we  also  evaluate 
the  precision  of  the  goal  inference  and  intent  pre¬ 
diction  online.  We  compare  the  result  of  the  pro¬ 
posed  algorithm  with  5  human  subjects  as  was  done 
in  the  cognitive  study  with  toy  examples  in  a  maze 
world  in  [2].  The  participants  viewed  the  videos 
with  several  judgement  points,  at  each  judgement 
point,  the  participants  were  asked  to  infer  the  goal 
of  the  agent  and  predict  his  next  action  with  prob¬ 
ability. 


Table  4:  Recognition  accuracy  of  our  algorithm. 


Scene 

Number  of 

event  instances 

Correct 

Accuracy 

Office 

32 

29 

0.906 

Lab 

12 

12 

1.000 

Hallway 

23 

23 

1.000 

Corridor 

9 

8 

0.888 

Outdoor 

11 

11 

1.000 

Figure  17  (a)  shows  five  judgement  points  of  an 
event  insertion  (making  a  call  in  the  process  of  get¬ 
ting  water).  Figure  17  (b)  shows  the  experimental 
results  of  event  segmentation  and  insertion.  Fig¬ 
ure  17  (c)  shows  the  goal  inference  result  obtained 
by  participants  and  our  algorithm  respectively,  and 
Figure  17  (d)  shows  the  intent  prediction  results. 
Our  algorithm  can  predict  one  or  multiple  steps  ac¬ 
cording  to  the  parse  graph.  Here  we  only  show  the 
result  of  predicting  one  step.  Although  the  proba¬ 
bilities  of  the  goal  inference  and  intent  prediction 
results  are  not  the  same  as  the  average  of  the  par¬ 
ticipants,  the  final  classifications  are  the  same.  In 
the  testing  video,  we  set  30  judgement  points  in  the 
middle  of  events.  The  accuracy  of  goal  inference  is 
90%  and  the  accuracy  of  intent  prediction  is  87%. 

5.f.  Atomic  action  recognition  with  event  context 

Due  to  the  ambiguity  of  bottom  up  detection, 
the  sequence  of  detected  atomic  actions  is  noisy 
and  prone  to  error.  We  propose  to  use  the  learned 
T-AOG  to  “de-noise”  the  atomic  actions  sequence. 
With  the  learned  spatial  and  temporal  grammars 
as  the  prior,  the  detection  of  atomic  actions  follows 


15 


Figure  15:  Experiment  results  of  event  parsing  for  multiple  agents.  Agent  PI  works  during  frames  4861  to  6196,  agent  P2  enters 
the  room  from  frames  6000  to  6196,  then  they  go  to  the  white  board,  have  a  discussion  and  leave  the  board.  The  semantic 
meaning  of  the  atomic  actions  are  shown  in  Table  2. 


Figure  16:  Video  parsing  result. 


a  Bayesian  maximum-a-posteriori: 

a*  =  arg  maxp(r|a;  ©)p( a;  Q) 

a 

where  r  is  the  sequence  of  grounded  relations  in  the 
video.  It  is  more  robust  than  merely  using  bottom 
up  proposals: 

abottom  up  =  arg  maxp(r|a;  ©) 

a 

where  Q  is  the  learned  grammar,  and  0  are  param¬ 
eters  of  the  bottom  up  detectors  of  atomic  actions. 
We  perform  an  experiment  on  a  collection  of  12061 
frames. 

Figure  18  shows  the  ROC  curve  of  the  recogni¬ 
tion  results  of  all  the  atomic  actions  in  the  test¬ 


ing  data.  The  ROC  is  computed  by  changing  the 
threshold  used  in  the  detection  of  atomic  actions. 
From  the  ROC  curve  we  can  see  that  with  event 
context,  the  recognition  rate  of  atomic  actions  is 
improved  greatly. 

5.5.  Scene  labeling  using  events 

In  the  previous  sections,  the  learning  and  pars¬ 
ing  of  T-AOG  relies  on  the  detection  or  manual 
labeling  of  objects  in  the  scene.  Now  we  try  to  re¬ 
lease  this  requirement  of  manual  labeling,  and  use 
the  T-AOG  to  infer  scene  semantics  automatically, 
thus  closing  the  loop  of  unsupervised  learning. 

Our  objective  is  to  label  the  scene  image,  espe¬ 
cially  objects  involved  in  the  event  parsing  for  a 
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(a)  Five  judgement  points  of  the  sequential  event  of  getting  water  and  making  a  call 
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Figure  19:  Scene  labeling  by  parsed  trajectories,  (a)  The  tra¬ 
jectories  of  the  agent’s  hands  and  feet,  (b)  The  segmentation 
of  objects  by  the  trajectory  “scribbles”,  (c)  The  segmenta¬ 
tion  of  adjacent  areas  of  4  and  5.  (d)  The  final  labeling  result 
for  interesting  objects. 


video  Ia  in  a  time  interval  A  =  [0,  T\.  For  example, 
a  drinking  action  indicates  the  location  of  a  cup, 
while  a  sitting  function  indicates  a  chair. 

Suppose  pg  is  an  event  parse  graph  from  Ia,  and 

R{  Pg)  =  {ri,-.,rN} 


Figure  17:  Experiment  results  of  event  segmentation,  in¬ 
sertion,  goal  inference  and  intent  prediction.  The  semantic 
meanings  of  the  atomic  actions  in  (d)  are  shown  in  Table  2. 


Recognition  Rate  Vs  False  Positive  Rate 


Figure  18:  The  ROC  curve  of  recognition  results  of  atomic 
actions. 


is  the  set  of  relations  in  pg  for  contextual  actions 
involving  the  interactions  between  agent  at  ob¬ 
served  position  x^gt,  and  object  at  unknown  po¬ 
sition  x\  where  l  is  the  object  label:  l  e  Ql  = 
{'desk'/ chair'/ cup', Thus  ri  =  ri(x^gt ,x\). 
This  can  be  easily  extended  to  multi-way  rela¬ 
tions.  We  denote  by  L  =  {L(x),L(x)  E  VLl,\/x  E 
rmimagelattice}  the  scene  label. 

The  scene  labeling  problem  is  then  formulated  as 
a  joint  inference, 

(L*,  Pg*)  =  argmaxp(L,pg*|IA) 

where  we  denote  pg  =  (R,  pg_,  R  is  the  alphabet  of 
the  T-AOG  and  pg_  is  the  hierarchical  parse  graph. 
Or, 

(L*,iT,pg_)  =  argmaxp(L,R,pg_|IA) 

=  arg  min  E(L,  R)  +  E(R ,  pg_ ) 
+E(L)  +  E(pg_) 

where  E(L)  is  a  smoothness  prior  on  L.  E(L,R) 
is  the  energy  terms  involving  the  set  of  relations  R 
and  the  object  label  L, 

N 

E{L,R)=Y,K{x\-xft), 

i= 1 
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with  K()  being  a  distance  function,  and  x\  being 
the  point  that  satisfies  two  conditions: 

1.  x\  G  Ql  =  {x  :  L{pc)  =  /}; 

2.  It  is  close  to  x^gt . 

So 

x\  —  arg  min  K(xi  —  x^gt ) . 

xie^i 

Thus  the  labeling  component  given  pg  or  i?(pg)  is 

L*  |pg  =  arg  min  E(L,  R )  +  E(L). 

The  first  time  is  similar  to  the  “user  scribbles”  in 
interactive  segmentation  and  labeling  [33].  Each 
label  l  G  Ql  has  a  set  of  scribble  points  {xl-  :  j  = 
1, n/},  where  XI ienLni  =  The  second  term 
utilizes  the  “scribble”  and  label  the  whole  scene 
based  on  image  properties  and  smoothness  assump¬ 
tions.  In  Figure  19  we  show  an  example  of  applying 
the  above  scene  labeling  inference  procedure. 

Figure  19  (a)  shows  the  trajectories  of  the  agent’s 
hands  and  feet.  Figure  19  (b)  shows  the  segmenta¬ 
tion  result  by  the  trajectories.  The  ground  is  suc¬ 
cessfully  segmented  by  the  trajectories  of  the  feet. 
The  keyboard,  phone,  microwave  are  segmented  by 
concentrated  trajectories  of  hands. 

The  segments  4  and  5  in  Figure  19  (b)  are  too 
large  to  be  interest  objects,  so  we  prune  them. 

Figure  19  (d)  shows  the  final  segmentation  result 
of  interesting  objects  in  the  scene. 

6.  Discussion  and  Conclusion 

In  summary,  we  propose  a  prototype  system  for 
event  learning,  which  explores  all  activities  that 
happen  in  a  certain  environment,  and  organizes 
them  in  a  meaningful  way  by  a  hierarchical  event 
dictionary  and  a  stochastic  T-AOG.  The  learned  T- 
AOG  can  be  used  to  parse  newly  observed  videos  to 
recognize  events.  We  also  show  a  promising  appli¬ 
cation  where  it  is  used  to  discover  scene  semantics 
without  manual  labeling  of  the  scene.  We  are  work¬ 
ing  towards  applying  to  more  diverse  data  sets  and 
obtaining  richer  T-AOG. 

We  present  an  algorithm  for  parsing  video  events 
with  goal  inference  and  intent  prediction.  Our  ex¬ 
periments  results  show  that  events,  including  those 
involving  multi-agents  and  those  happening  concur¬ 
rently  can  be  recognized  accurately,  and  the  ambi¬ 
guity  in  the  recognition  of  atomic  actions  can  be 
reduced  largely  using  hierarchical  event  contexts. 


Future  work.  The  objects  of  interest  in  the 
scene  are  detected  semi- automatically  at  present. 
The  event  context  provides  a  lot  of  information  of 
the  objects  involved  in  the  event,  and  can  be  uti¬ 
lized  to  detect  and  recognize  objects.  We  are  ac¬ 
tively  pursuing  further  progress  in  the  following  as¬ 
pects: 

•  Using  kinect  data  to  better  define  agent  poses 
in  the  3D  setting. 

•  Clustering  more  specific  actions.  An  action  can 
be  defined  as  a  set  of  typical  configurations, 
each  specified  by  a  number  of  spatial  interac¬ 
tions  between  agent  and  environment.  E.g.  a 
sitting  pose  can  be  specified  by  interactions  be¬ 
tween  person’s  body  and  chair,  hand  and  key¬ 
board,  body  and  desk  etc. 

•  Using  n-nary  relations  to  handle  group  activi¬ 
ties. 
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